Corpora in Translation: A Slovene Perspective
نویسنده
چکیده
This paper reviews the use of corpora in translation practice and translator training, focusing on currently available monolingual and multilingual language resources for Slovene. The first part of the paper briefly outlines the state-of-the-art in corpus linguistics and then introduces publicly available corpora for Slovene, including general and special language corpora as well as several parallel corpora. The advantages and potential pitfalls of using the web as a corpus are also discussed. Part two presents some important considerations and guidelines for using corpora in both training situations and, more specifically, real-world translation projects. In many respects, corpora represent a richer source of information for a translator than dictionaries; on the other hand, a corpus user must know how to critically interpret the results obtained via a corpus query. Because corpora may not be readily available for many special domains and/or language pairs, procedures and tools for compiling one’s own corpora are also described.
منابع مشابه
Slovene-English Datasets for MT
Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-E...
متن کاملThe ELAN Slovene-English Aligned Corpus
Multilingual parallel corpora are a basic resource for research and development of MT. Such corpora are still scarce, especially for lower-diffusion languages. The paper presents a sentence-aligned tokenised Slovene-English corpus, developed in the scope of the EU ELAN project. The corpus contains 1 million words from fifteen recent terminology-rich texts and is encoded according to the Guideli...
متن کاملThe JOS Morphosyntactically Tagged Corpus of Slovene
The JOS morphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million word partially hand validated corpus. The two corpora have been sampled from the 600M word Slovene reference corpus FidaPLUS. The J...
متن کاملUsing Multilingual Resources for Building SloWNet Faster
This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries fr...
متن کاملBootstrapping Bilingual Lexicons from Comparable Corpora for Closely Related Languages
In this paper we present an approach to bootstrap a Croatian-Slovene bilingual lexicon from comparable news corpora from scratch, without relying on any external bilingual knowledge resource. Instead of using a dictionary to translate context vectors, we build a seed lexicon from identical words in both languages and extend it with context-based cognates and translation candidates of the most f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008